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■ Abstract: We study the asymptotic properties of the SCAD-penalized least 

squares estimator in sparse, high-dimensional, linear regression models when 
the number of covariates may increase with the sample size. We are particularly 

^ I interested in the use of this estimator for simultaneous variable selection and 

^ , estimation. We show that under appropriate conditions, the SCAD-penalized 

■ least squares estimator is consistent for variable selection and that the esti- 
' mators of nonzero coefficients have the same asymptotic distribution as they 

would have if the zero coefficients were known in advance. Simulation studies 
indicate that this estimator performs well in terms of variable selection and 
estimation. 



1. Introduction 

Consider a linear regression model 

r = /3o + X'/3 + e, 



1^ , where /3 is a p x 1 vector of regression coefScients associated with X. We are 

' interested in estimating /3 when p —^ oo as the sample size n —* oo and when (3 

■ is sparse, in the sense that many of its elements are zero. This is motivated from 

op ' biomedical studies investigating the relationship between a phenotype of interest 

and genomic covariates such as microarray data. In many cases, it is reasonable 
ON ■ to assume a sparse model, because the number of important covariates is usually 

, relatively small, although the total number of covariates can be large. 

We use the SCAD method to achieve variable selection and estimation of (3 
simultaneously. The SCAD method is proposed by Fan and Li [l| in a general 
parametric framework for variable selection and efficient estimation. This method 
uses a specially designed penalty function, the smoothly clipped absolute deviation 
^ ' (hence the name SCAD). Compared to the classical variable selection methods such 

5^ , as subset selection, the SCAD has two advantages. First, the variable selection with 

SCAD is continuous and hence more stable than the subset selection, which is a 
discrete and non-continuous process. Second, the SCAD is computationally feasible 
for high-dimensional data. In contrast, computation in subset selection is combina- 
torial and not feasible when p is large. In addition to the SCAD method, several 
other penalized methods have also been proposed to achieve variable selection and 
estimation simultaneously. Examples include the bridge penalty (Frank and Fried- 
man [j). LASSO (Tibshirani [ll|), and the Elastic- Net (Enet) penalty (Zou and 
Hastie [IJ]), among others. 
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Fan and Li fH and Fan and Peng studied asymptotic properties of SCAD 
penalized likelihood methods. Their results are concerned with local maximizers of 
the penalized likelihood, but not the maximum penalized estimators. These results 
do not imply existence of an estimator with the properties of the local maximizer 
without auxiliary information about the true parameter value. Therefore, they are 
not applicable to the SCAD-penalized maximum likelihood estimators, nor the 
SCAD-penalized estimator. Knight and Fu [rj studied the asymptotic distributions 
of bridge estimators when the number of covariates is fixed. Huang, Horowitz and 
Ma [3] studied the bridge estimators with a divergent number of covariates in a 
linear regression model. They showed that the bridge estimators have an oracle 
property under appropriate conditions if the bridge index is strictly between and 1. 
Several earlier studies have investigated the properties of regression estimators with 
a divergent number of covariates. See, for example, Huber [5| and Portnoy 
Portnoy proved consistency and asymptotic normality of a class of M-cstimators of 
regression parameters under appropriate conditions. However, he did not consider 
penalized regression or selection of variables in sparse models. 

In this paper, we study the asymptotic properties of the SCAD-penalized least 
squares estimator, abbreviated as LS-SCAD estimator henceforth. We show that 
the LS-SCAD estimator can correctly select the nonzero coefhcients with prob- 
ability converging to one and that the estimators of the nonzero coefficients are 
asymptotically normal with the same means and covariances as they would have if 
the zero coefficients were known in advance. Thus, the LS-SCAD estimators have 
an oracle property in the sense of Fan and Li [ij and Fan and Peng Q . In other 
words, this estimator is asymptotically as efficient as the ideal estimator assisted 
by an oracle who knows which coefficients are nonzero and which are zero. 

The rest of this article is organized as follows. In Section 2, we define the LS- 
SCAD estimator. The main results for the LS-SCAD estimator are given in Section 
3, including the consistency and oracle properties. Section 4 describes an algorithm 
for computing the LS-SCAD estimator and the criterion to choose the penalty pa- 
rameter. Section 5 offers simulation studies that illustrate the finite sample behavior 
of this estimator. Some concluding remarks are given in Section 6. The proofs are 
relegated to the Appendix. 

2. Penalized regression with the SCAD penalty 

Let (Xi, y^), i = 1, . . . , n be n observations satisfying 

Fi = /3o + -t- e„ i = l,...,n, 

where € R is a response variable, is a p„ x 1 covariate vector and has mean 
and variance a^. Here the superscripts are used to make it explicit that both the 
covariates and parameters may change with n. For simplicity, we assume (3q = 0. 
Otherwise we can center the covariates and responses first. 

In sparse models, the Pn covariates can be classified into two categories: the 
important ones whose corresponding coefficients are nonzero and the trivial ones 
whose coefficients are zero. For convenience of notation, we write 

/3-(/3'i,/32)', 

where (3[ = . . . and — (0, . . . ,0). Here fc„(< p„) is the number of 

nontrivial covariates. Let rn„ = p„ — fc„ be the number of zero coefficients. Let 
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Y = (Yi, . . . , Yn)' and let X = {Xij, 1 < i < n,l < j < pn) be the n x Pn design 
matrix. According to the partition of (3, write X = (Xi,X2), where Xi and X2 are 
n X kn and n x m„ matrices, respectively. 

Given a > 2 and A > 0, the SCAD penalty at 9 is 

fAI^I, \9\<X, 
p^{9; a) = I -{9^ - 2aX\9\ + X^)/[2{a - 1)], A < |6'| < aX, 
[(a + l)AV2, |6i|>oA. 

More insight into it can be gained through its first derivative: 

r sgn{9)X, \9\ < X, 

p\i9; a) = { sgn(0)(aA - \9\)/ia - 1), A < |^| < aX, 

[0, \e\>ax. 

The SCAD penalty is continuously differentiable on (— oo,0) U (0,oo), but not dif- 
fcrcntiablc at 0. Its derivative vanishes outside [— aA, aX]. As a consequence, SCAD 
penalized regression can produce sparse solutions and unbiased estimates for large 
coefficients. More detailed discussions of this penalty can be found in Fan and Li 
(2001). 

The penalized least squares objective function for estimating (3 with the SCAD 
penalty is 

(1) Qnih;Xr„a) = \\Y -Xhf + nj2pxnibj;a), 

where || • || is the L2 norm. Given penalty parameters A„ and o, the LS-SCAD 
estimator of /9 is 

3„ = 3(A„;a) = argmin(5„(b; A„,a). 
^ ^/ ^/ 

We write /3„ = (/9i„,/32„) the way we partition f3 into f3i and /32- 

3. Asymptotic properties of the LS-SCAD estimator 

In this section we state the results on the asymptotic properties of the LS-SCAD 

estimator. Results for the case of fixed design are slightly different from those for 
the case of random design. We state them separately. 

For convenience, the main assumptions required for conclusions in this section 
are listed here. (AO) through (A4) arc for fixed covariatcs. Let /9„_i be the smallest 
eigenvalue of n~^X'X. 7r„,fc^ and ujn,m„ ^ire the largest eigenvalues of n~"'^X'j^Xi and 
n~^X'2X2, respectively. Let X^^ = {Xa, XikJ and X^2 = • • • , ^ip„)- 

(AO) (a) Ej's are i.i.d with mean and variance a^; 

(b) For any j S {1, . . . ,p„}, ||X.jf = n. 
(Al) (a) lim„^oo Vk^Xn/y/p^ = 0; 

(b) lim„^oo ^JV^Is/Wi^ = 0. 
(A2) (a) limn^ooVh'iXn/{y/p^m.mi<j<k^ \/3j\) = 0; 

(b) lim„^oo y/p7i/ {y/npns mini<j<fe„ = 0; 

(c) lim„^c o VPn/n/pn,i = 0. 

(A3) lim„^oo ■\/max(7r„,fe„, w„,m„)p„/(\/n/On,iA„) = 0. 
(A4) lim„^oomaxi<j<„X^i(X;"=iXiiX^i)-^Xa =0. 
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For random covariates, we require conditions (BO) through (B3). Suppose (XJ, 
Ei ) 's are independent and identically distributed as (X' , e) — {Xi , . . . , Xp^ ,e). Anal- 
ogous to the fixed design case, pi denotes the smallest eigenvalue of S[XX']. Also 
TTfe^ and ijJm„ are the largest eigenvalues of ii'fXjiX^]^] and i?[Xi2X^2]j respectively. 

(BO) (X,-,£i) = (Xii, . . . ,Xip^,£i),i = 1, . . . ,n are i.i.d. with 

(a) E[X,j] = 0, Var(X,,) = 1; 

(b) £:[£|X] = 0, Var(£|X) = a"^ . 
(Bl) (a) Iim„^o,p2/(^p2)^0; 

(b) lim„^oo knXi/pi = 0. 
(B2) (a) lim„^oo VP^/(V^inini<j<fe„ = 0; 
(b) lim„^oo A„V^/(^/P^minl<J<fe„ |/3j|) = 0. 

(B3) 

A/niax(7rfc„, ujmJPn 
hm — — = 0. 

Theorem 1 (Consistency in the fixed design setting). Under (AO)-(Al), 

||/3„ — /3|| ^ as n ^ oo. 

A similar result holds for the random design case. 

Theorem 2 (Consistency in the random design setting). Suppose that there exists 
an absolute constant M4 such that for all n, maxi<j<p^ < M4 < 00. Then 

under (BO)-(Bl), 

||/3„ — /3|| ^ as n ^ 00. 

For consistency, A„ has to be kept small so that the SCAD penalty would not 
introduce any bias asymptotically. Note that in both design settings, the restriction 
on the penalty parameter An does not involve m„, the number of trivial covariates. 
This is shared by the Lq{Q < q < l)-penalized estimators in Huang, Horowitz 
and Ma However, unlike the bridge estimators, no upper bound requirement 
is imposed on the components of /3i, since the derivative of the SCAD penalty 
vanishes beyond a certain interval while that of the Lq penalty does not. In the fixed 
design case, (Al.b) is needed for model identifiability, as required in the classical 
regression. For the random design case, a stricter requirement on p„ is entailed by 
the need of the convergence of n-^X'X to £;[XX'] in the Frobenius norm. 

The next two theorems state that the LS-SCAD estimator is consistent for vari- 
able selection. 

Theorem 3 (Variable selection in the fixed design setting). Under (A0)-(A3), 
/32„ = 0„^^ with probability tending to 1. 

Theorem 4 (Variable selection in the random design setting) . Suppose there exists 
an absolute constant M such that maxi<j<p^ 1^ < 00. Then under (BO)- 
(B3), f32n = Om„ with probability tending to 1. 

(A2.a) and (A2.b) are identical to (Al.a) and (Al.b), respectively, provided that 

liminf min \(3j\ > 0. 

n^oo l<j<kn 

(B2) has a requirement for mini<j<fc^ \f3j \ similar to (A2). (A3) concerns the largest 
eigenvalues of ti^^X'j^Xi and 71^^X2X2. Due to the standardization of covariates. 
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So (A3) is implied by 



y Pn 

iim 



Likewise, (B3) can be replaced with 

lim = 0. 

Both (A3) and (B3) require A„ not to converge too fast to in order for the 
estimator to be able to "discover" the trivial covariates. It may be of concern if there 
are A„'s that simultaneously satisfy (A1)-(A3) (in the random design setting (Bl)- 
(B3)) under certain conditions. When liminfp„^i > and liminf„_>oo iniiii<j<fc„ 
> 0, it can be checked that there exists A„ that meets both (A2) and (A3) as 
long as Pn = oin^l^^. If we further know either that fc„ is fixed, or that the largest 
eigenvalue of n~^X'X is bounded from above, as is assumed in Fan and Peng 2], 
Pn = o(ri^/^) is sufficient. When both of these are true, p„ = o(ji) is adequate 
for the existence of such A„'s. Similar conclusions hold for the random design case 
except that p„ = o(n^/^) is indispensable there. 

The advantage of the SCAD penalty is that once the trivial covariates have been 
correctly picked out, regression with or without the SCAD penalty will make no 
difference to the nontrivial covariates. So it is expected that /3]^„ is asymptotically 
normally distributed. Let {A„, n = 1, 2, . . .} be a sequence of matrices of dimension 
d X kn with full row rank. 

Theorem 5 (Asymptotic normality in the fixed design setting). Under (A0)-(A4), 

xAIE-i/2A„(3i„ - /3i) ^ N{Od,Id), 
where E„ = a2A„(^^^^^ X,aXJi/n)-iA;,. 

Theorem 6 (Asymptotic normality in the random design setting). Suppose that 
there exists an absolute constant M such that maxi<j<p^^ l|A"jl| < M < oo and a 
such that £'[e''|Xii] < 0-4 < oo for all n. Then under (BO)-(BS), 

n 

E-i/2A„i?-i/2[x,,X^J J2 X.iX^i(3i„ - /3i) ^ N{Od, Id), 

i=l 

where E„ = cr^AnAJ^. 

For the random design the assumptions for asymptotic normality are no more 
than those for variable selection. While for the fixed design, a Lindeberg-Feller 
condition (A4) is needed in addition to (A0)-(A3). 



4. Computation 

We use the algorithm of Hunter and Li Q to compute the LS-SCAD estimator 
for a given A„ and a. This algorithm approximates a nonconvex target function 
with a convex function locally at each iteration step. We also describe the steps to 
compute the approximate standard error of the estimator. 
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4-1. Computation of the LS-SCAD estimator 



Given A„ and a the target function to be minimized is 

g„ (b; A„ , a) - ^ (y, - X^b)2 + n ^ Pa„ {bj ; < 

i=i ]=i 

Hunter and Li ^6] proposes to minimize its approximation 

n pn 



i=l 



i=i j=i \ -"J 

Around ^(k) = (^(fc),i5 • ■ ■ i ^(fe),p„)': it can be approximated by 



5fc,j(b;A„,a)=^(y,-X:b)^ 



i=l 



nJ2 



where ^ is a very small perturbation to prevent any component of the estimate from 
getting stuck at 0. Therefore the one-step estimator starting from h(k) is 

b(fc+i) = (X'X + nD5(b(fe);A„,a))-iX'Y, 

where D^(b(j.) ; A„, a) is the diagonal matrix whose diagonal elements are x 
(|6(fc)j|-|-; a)/(^ + = l,...,Pn- Given the tolerance r, convergence is 

claimed when 



dbi 



And finally the 5j's that satisfy 



< 2' ^J = l>--->-P" 



dbi 



dbi 



<P'\J\bj\ + ; a) T 



are set to 0. A good starting point would be b^p) — /3ls, the least squares estimator. 

The perturbation ^ should be kept small so that difference between Qn,e,i') and 
Q„(-) is negligible. Hunter and Li [6| suggests using 



2nXn 



4.2. Standard errors 

The standard errors for the nonzero coefficient estimates can be obtained via the 
approximation 



a^5(/3i„;A,a) dS^if3^; Xn,a) , S^S^ (/3i; A„, a) 



(3i„-/3i 
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So 



/3ln - /3l ~ - 



d(3,df3[ 



df3. 



Since 



= -2X' -Y + 2X' .Xi3i„ + n 



E 



-2XijYi + 2XijX'.i/3i„ + 



letting Uij = Uij{^\ A„, a), we have, for j, i = 1, . . . , A;„, 



Gov ( rt-^/^ ^'^^^^^"''^"'"'^ ^-1/2 dS^{P-^^;X„,a) 



n n 



i=l i=l i=l 



Let C = {Cji^j, / = !,..., where 



^ n 1 ^ 

= - E ~ E E • 



i=l i=l i=l 

The variance-covariance matrix of the estimates can be approximated by 

Cov(3i„) = n(X;Xi +nD5(3i„;A„,a))-i C {X[X, + nn^0^„; Xn,a))-\ 

4-3. Selection of Xn 

The above computational algorithm is for the case when A„ and a are specified. In 
data analysis, they can be selected by minimizing the generalized cross validation 
score, which is defined to be 



GCV(A„,a) 



||Y-Xi/3iJ|Vn 



where 



p(A„,a) =tr 



(1 -p(A„,a)/n)2' 
Xi (x;Xi + nDo(3i„; A„, a)) X; 



is the number of effective parameters and Do(/3i„; A„, a) is a submatrix of the 
diagonal matrix Dg(/3jj; A„, a) with ^ = 0. By submatrix, we mean the diago- 
nal of Do(/3i„; A„, a) only contains the elements corresponding to the nontrivial 
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components in f3. Note that here Xi also only includes the columns of which the 
corresponding elements of /3„ are non-vanishing. 

The requirement that a > 2 is implied by the SCAD penalty function. Simulation 
suggests that the generalized cross validation score does not change much with a 
given A. So to improve computing efficiency, we fix a = 3.7, as suggested by Fan 
and Li [H. 

5. Simulation studies 

In this section we illustrate the LS-SCAD estimator's finite sample properties with 
a simulated example. 

We simulate covariates X^,? — I, . . . ,rt from the multivariate normal distribu- 
tions with mean and 

Cov(A,,,X,0=pl^-'l,l<J,;<P, 

The response Yi is computed as 

p 

=^^XijPj + Si, i = l,...,n. 

i=i 

where /3j = j, 1 < j < Pj — 0,5 < j < p, and e^'s are sampled from iV(0, 1). 
For each {n,p,p) e {(100, 10), (500, 40)} x {0,0.2,0.5,0.8}, we generated N = 400 
data sets and use the algorithm in Section 4 to compute the LS-SCAD estimator. 
We set the tolerance r described in Section 4.1 at 10~^. For comparison we also 
apply the ordinary least square (LS) method, the ordinary least square method 
with model selection based on AIC (abbreviated as AIC), and the ordinary least 
squares assuming that /3j = for j > 5 are known beforehand (ORA). Note that 
this last estimator (ORA) is not feasible in a real data analysis setting. We use it 
here as a benchmark in the comparisons. 

The results are summarized in Tables [T] and [5] Columns 4 through 7 in Table [T] 
are the biases of the estimates oi — 1, ... ,4 respectively. In the parentheses 
following each of them are the standard deviations of these estimates. Column 8 
(K) lists the numbers of estimates oi (3j,5 < j < p that are 0, averaged over 400 
replications, and their modes are given in Column 9 (K). For LS, an estimate is set 
to be if it lies within [-10-^ 10"^]. 

In Table[l] we see that the LS-SCAD estimates of the nontrivial coefficients have 
biases and standard errors comparable to the ORA estimates. This is in line with 
Theorems 5 and 6. The average numbers of nonzero estimates for Pj{j > 4), K, 
with respect to LS-SCAD are close to p, the true number of nonzero coefficients 
among > 4). As the true number of trivial covariates increases, the LS-SCAD 
estimator may be able to discover more trivial ones than AIC. However, there is 
more variability in the number of trivial covariates discovered via LS-SCAD than 
that via AIC. 

Table [2] gives the averages of the estimated standard errors of /3j , 1 < j < 4 
using the SCAD method over the 400 replications. They are obtained based on 
the approach described in Section 4.2. They are slightly smaller than the sampling 
standard deviations of /3j, 1 < j < 4, which are given in parentheses in the rows for 
LS-SCAD. 

Suppose for a data set the estimate of (3 via one of these four approaches is 
/3, then the average model error (AME) regarding this approach is computed as 
I]"=i[^i(3„ - l3)?- Box plots for these AME's are given in Figure[TJ 
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Table 1 

Simulation example 1, comparison of estimators 



(n,p) p 


Estimator 


/9i 




ft 




/93 




ft 


K 


K 


(100, 10) 


LS 


.0007 ( 


.1112) 


-.0034 


.0979) 


-.0064 ( 


.1127) 


-.0024 (.1091) 










ORA 


.0008 ( 


.1074) 


-.0054 


.0936) 


-.0057 ( 


1072) 


-.0007 (.1040) 


6 


6 




AIC 


.0007 ( 


.1083) 


— .0026 


.1033) 


-.0060 ( 


.1156) 


-.0019 (.1181) 


4.91 


5 




SCAD 


-.0006 ( 


.1094) 


-.0037 


;.0950) 


-.0058 ( 


.1094) 


-.0014 (.1060) 


4.62 


5 


0.2 


LS 


-.0003 ( 


.1051) 


-.0028 


'.1068) 


.0093 ( 


.1157) 


.0037 


(.1103) 










ORA 


-.0005 ( 


.1010) 


-.0031 


.1035) 


.0107 ( 


.1131) 


.0020 


(.1035) 


6 


6 




AIC 


—.0002 ( 


.1031) 


—.0024 


.1063) 


.0107 ( 


1150) 


.0021 


(.1079) 


4.95 


5 




SCAD 


-.0025 ( 


.1035) 


-.0026 


.1046) 


.0104 ( 


1141) 


.0024 


(.1066) 


4.64 


5 


0.5 


LS 


.0000 ( 


.1177) 


-.0007 


.1353) 


.0010 ( 


1438) 


.0006 


(.1360) 










ORA 


-.0002 ( 


.1129) 


-.0072 


,.1317) 


.0115 ( 


.1393) 


.0022 


(.1171) 


6 


6 




AIC 


— .0003 ( 


.1162) 


— .0064 


'.1338) 


.0114 ( 


.1413) 


.0017 


(.1294) 


4.91 


5 




SCAD 


.0035 ( 


.1115) 


-.0219 


.1404) 


.0135 ( 


1481) 


.0006 


(.1293) 


4.78 


5 


0.8 


LS 


-.0005 ( 


.1916) 


-.0229 


.2293) 


.0059 ( 


2319) 


.0060 


(.2200) 










ORA 


-.0039 ( 


.1835) 


-.0196 


.2197) 


.0070 ( 


2250) 


.0092 


(.1787) 


6 


6 




AIC 


— .0021 ( 


.1857) 


— .0209 


.2235) 


.0063 ( 


.2289) 


.0013 


(.2072) 


4.85 


6 




SCAD 


-.0038 ( 


.1868) 


-.0197 


'.2249) 


.0062 ( 


.2280) 


.0032 


(.2024) 


4.87 


6 


(500,40) 


LS 


.0021 ( 


.0466) 


-.0000 


.0475) 


-.0010 ( 


.0466) 


.0014 


(.0439) 










ORA 


.0027 ( 


.0446) 


-.0005 


,.0453) 


-.0003 ( 


.0448) 


.0011 


(.0426) 


36 


36 




AIC 


.0023 ( 


.0460) 


— .0003 


^.0465) 


—.0004 ( 


.0453) 


.0016 


(.0433) 


29.91 


30 




SCAD 


.0027 ( 


.0447) 


-.0004 


'.0454) 


-.0004 ( 


.0450) 


.0013 


(.0429) 


32.22 


35 


0.2 


LS 


.0018 ( 


.0478) 


.0003 


.0478) 


-.0014 ( 


.0487) 


.0005 


(.0437) 










ORA 


.0003 ( 


.0522) 


-.0000 


.0465) 


-.0010 ( 


.0517) 


.0009 


(.0458) 


36 


36 




AIC 


.0024 ( 


.0473) 


.0002 


.0471) 


-.0014 ( 


.0475) 


.0018 


(.0436) 


29.87 


30 




SCAD 


.0028 ( 


.0461) 


.0002 


.0460) 


-.0011 ( 


.0475) 


.0006 


(.0433) 


32.20 


35 


0.5 


LS 


.0024 ( 


.0542) 


.0001 


,.0617) 


.0050 ( 


.0608) 


-.0048 (.0563) 










ORA 


.0027 ( 


.0526) 


.0017 


;.0581) 


.0033 ( 


.0597) 


-.0030 (.0488) 


36 


36 




AIC 


.0031 ( 


.0537) 


.0007 


'.0603) 


.0037 ( 


.0605) 


-.0038 (.0526) 


29.87 


32 




SCAD 


.0025 ( 


.0528) 


.0017 


.0587) 


.0034 ( 


0601) 


-.0037 (.0494) 


31.855 


35 


0.8 


LS 


.0014 ( 


.0788) 


-.0012 


.1014) 


.0090 ( 


1000) 


-.0077 (.0943) 










ORA 


.0010 ( 


.0761) 


.0017 


.0954) 


.0060 ( 


0983) 


-.0044 (.0704) 


36 


36 




AIC 


.0020 ( 


.0776) 


.0003 


.0996) 


.0066 ( 


0995) 


-.0071 (.0862) 


29.56 


30 




SCAD 


.0014 ( 


.0773) 


.0018 


,.0982) 


.0059 ( 


.0990) 


-.0050 (.0790) 


29.38 


35 



Table 2 

Simulated example, standard error estimate 

{n,p) (100, 10) (500, 40) 

p 02 0^5 OS 0^2 05 08_ 

se(ft) .0983 .1005 .1139 .1624 .0442 .0444 .0512 .0735 
se02) .0980 .1028 .1276 .2080 .0443 .0447 .0571 .0940 
se03) .0996 .1027 .1278 .2086 .0442 .0445 .0573 .0940 
se(ft) .0988 .1006 .1150 .1727 .0441 .0444 .0512 .0764 

The LS estimator definitely has the worst performance in terms of AME. This 
becomes more obvious as the number of trivial predictors increases. LS-SCAD out- 
performs AIC in this respect and is comparable to ORA. But it is also seen that the 
AME's of LS-SCAD tend to be more diffuse as p increases. This is also the result 
of more spread-out estimates of the number of trivial covariates. 

6. Concluding remarks 

In this paper, we have studied the asymptotic properties of the LS-SCAD estimator 
when the number of covariates and regression coefficients increases to infinity as 
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(n,p,pH100,10,0.2) (n,p,pH100,10,0.5) 



(n,p,pH100,10,0.8) 




(n,p,pH500,40,0) (n,p,pH500,40,0.2) (n, p, p H500,40, 0.5) (n, p, p H500,40, 0.8) 




Fig 1. Box plots of the average model errors for four estimators: AIC, LS, ORA, and LS-SCAD. 
In the top four panels, {n,p,p) = (100, 10,0), (100, 10, 0.2), (100, 10, 0.5), (100, 10,0.8); and in the 
bottom four panels, {n,p,p) = (500,40,0), (500,40,0.2), (500, 40,0.5), (500, 40,0.8), where n is the 
sample size, p is the number of covariates, and p is the correlation coefficient used in generating 
the covariate values. 



n —^ oo. We have shown that this estimator can correctly identify zero coefficients 
with probabihty converging to one and that the estimators of nonzero coefficients 
are asymptotically normal and oracle efficient. Our results were obtained under the 
assumption that the number of parameters is smaller than the sample size. They 
are not applicable when the number of parameters is greater than the sample size, 
which arises in microarray gene expression studies. In general, the condition that 
p < n is needed for identification of the regression parameter and consistent variable 
selection. To achieve consistent variable selection in the "large p, small n" case, 
certain conditions are required for the design matrix. For example, Huang et al. [3] 
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showed that, under a partial orthogonahty assumption in which the covariates of the 
zero coefficients are uncorrelated or only weakly correlated with the covariates of 
nonzero coefficients, then the univariate bridge estimators are consistent for variable 
selection under appropriate conditions. This result also holds for the univariate 
LS-SCAD estimator. Indeed, under the partial orthogonality condition, it can be 
shown that the simple univariate regression estimator can be used to consistently 
distinguish between nonzero and zero coefficients. Finally, we note that our results 
are only valid for a fixed sequence of penalty parameters A„. It is an interesting 
and difficult problem to show that the asymptotic oracle property also holds for A„ 
determined by cross validation. 

Appendix 

We now give the proofs of the results stated in Section 3. 

Proof of Theorem[l[ By the definition of /3„, it is necessary that Qn{(3„) < (5„(/3). 
It follows that 

> 1 1 X(3„ - /3) f - 2£'X(3„ - /3) + n £ 0: ; a) - px„ ; a) 

> ||X(3„ - /3)f - 2£'X(3„ -13)- 2-^n{a + l)k.nXl 
= ||[X'X]i/2(3^^_^)_ [x'X]-i/2x'£||2 
- e'X[X'X]-iX'£ - 2-^n{a + l)fc„A^. 

By the Cr-inequality (Loeve [§], page 155), 

\\[X'X\^/^0„- /3)f < 2||[X'X]i/2(3^^ _ [X'X]-i/2x'£||2 + 2£'X[X'X]-iX'£ 
< 4e'X[X'X]-iX'£ + n{a + l)/e„A^. 

In the fixed design, 

£'x[x(")'x]-ix'£ = i; [£'x[x(")'x]-ix'£j Op(i) 



Since 
we have 



^ a^triX[X'X]-^X')Op{l) 
-p„Op(l). 

\[x'x]'/'{X-m'>npnAA-m', 



\X m = Op + ) = op(i). □ 



Proof of Theoremm Let A(") = (Aj')j,fc=i,...,p,. with A^^^ = n"'Er=i^u^»fc " 
E[XijXik]. Let pi(A("') and pp^(A(")) be the smallest and largest of the eigenval- 
ues of A("\ respectively. Then by Theorem 4.1 in Wang and Jia [l3| . 

Pi(A(")) <p„4-pi <pp„(A). 

By the Cauchy inequality and the properties of eigenvalues of symmetric matrices, 

max(|pi(A("))|,|p,„(A("))|)<||A(")||. 
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When (Bl.a) holds, ||A(")|| = op(pi) = op(l), as is seen for any f > 0, 



£,Pl l<j,k<p. 



sup Y^t{A%^) < ^^M^. 



Since pi > holds for all n, n ^X'X is invertible with probability tending to 1. 
Following the argument for the fixed design case, with probability tending to 1, 

II [X'X]^/^0„ - f3)f < 4e'X[X'X\-^X'£ - n{a + l)fc„A2 . 

In the random design setting, 



E 



tr(X[X'X]-iX') 



The rest of the argument remains the same as for the fixed design case and leads 
to 



- m = Op 



Op(l). 



/npi y/Pi 

Lemma 1 (Convergency rate in the fixed design setting). Under (A0)-(A2), ||/3„- 
Proof. In the proof of consistency, we have 



□ 



ll/3„ - /3|| = Op{un), where u„ = \n\J PnS + \J Pn/ {npn,i). 
For any Li, provided that ||b — /3|| < 2^iu„, 

,niin |&,|>^min |/3,|-2^^u„. 

If (A2) holds, then for n sufficiently large, it„/mini<j<fc^ \f3j \ < 2^^^^^. It follows 
that 

min \bj\ > min |/3j|/2, 

l<J<kn l<J<kji 

which further implies than mini<j<fc^ \bj\ > aA„ for n sufficiently large (assume 
liminfn^oo fc„ > 0). 

Let {hn} be a sequence converging to 0. As in the proof of of Theorem 3.2.5 
of Van der Vaart and Wellner decompose Ti^"\{Op^} into shells {Sn,i,l G Z} 
where Sn,i = {b : 2'-i/i„ < ||b-/3|| < 2'/i„}. For b G Sn,i such that 2'hn < 

Qn{h) - Q„(/3) = (b - /3)'X'X(b -(3)- 2£'X(b - (3) 

Pn Pn 

+ P\„ (6j ; a) - n ^ pA„ ; a) 
= (b - /3)'X'X(b - /3) - 2e'X(b - /3) 

— -^71 1 + In2, 



and 



/nl>np„,i|lb-^|l2>22('^l)/i2„^„, 
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Thus 



P (||3„ - /3|| > 2^/i„ 

< 0(1) + P{^n^S, 



71,1 



1>L 



1>L 



< o(l) + V P [ sup £'X(b - /3) > 2^^~^hlnpn i 



<«(!)+ E 



i?|supbes„, £'X(b-/3)| 



<o(l)+E 



2'-ih„<2-^iM„ 

2'/i„£;i/2[||£'x||2] 



<o(l)+E 



2VnaV, 
22'-3/i„np„,i ■ 



from which we see ||/3„ - /3|| = Op[^Jpn/n/ pn,i). 



□ 



Lemma 2 (Convergence rate in the random design setting). Under (B0)~(B2) , 

Il3„-/3|| -Op(v^;7^/Pi). 

Proof. Deduction is similar to that of Lemma [TJ However, since X is a random 
matrix in this case, extra details are needed in the following part. Let A^"^ = 

{Af^)3,k=i,...,p^ with 4'^) = i X,jX,k - E[XjXk]. We have 



P(||3„-/3|| >2^/i, 

< E ^(3„e5„,z,||A(")|| <pi/2)+o(l) 



1>L 



^ E ^fj^inf ^Qn(b) <Q„(/3),||A(")|| <pi/2j +o(l) 

2'/i„<2^iti„ 



< 



E 

1>L 



2'/l„^l/2 



£'X||2 ||A|| <pi/2 



22'-4/i2„^^ 



0(1). 



The first inequality follows from (Bl.a). This leads to ||/3„ — /3|| = Op{y^Pn/n/ pi). 



□ 



Proof of Theorem\3[ By Lemma [T] ||/3„ — j3\\ < A„ with probability tending to 1 
under (A3). Consider the partial derivatives of Q„(/3 + v). For j = fc„ + 1, . . . ,p„, 
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2 ^ Xij{ei - X-v) + nXnSgn{vj) 



if \vj \ < A„, 
gQn(/3 + v) 

n n n 

= 2 ^ - 2 ^ X'iiVi - 2 ^ ^»jX^2V2 + nXnSgn{vj) 

i=l i=l 

Examine the first three terms one by one. 



i=l 



E[ max \n„ii\]<E^/^ 



2^nm„a, 



max \IIn2 i | = 2 max 

krv + l<j<Pn kr, + l<j<Pr, 



iVl 



< 2||vi II max J(X.,)'XiX;X.,- 



< 2||vi|| max ||X.,||pV2jXiX;) 
= 2||vi|| max ||X.,-|| pV4(x;Xi) 

= 2n^7rn,/c„||vi||, 

n 

u ]^^\ '-^^"Sj 1=2 max | V X^y^^^V2\ 

1=1 

<2||vi||||X.,||pi/^(X'2X2) 
= 2n^a;„,„„||v2||. 



Following the above argument we have 



P I y { |77„i,,- 1 > |7/„4J I - |/7n2 J I - |//n3 J I } 

, fe„ + l<j<p„ 



< 



2Y^nm„cr2 



nA„ - 2n (VTTn.fc^ ||vi || + ^ 



When (A3) holds, y/nXn/ y/run — > oo. Under (A1)-(A2), ||v|| = Op(-\/p^i7^/Pn,i)- 
Therefore 



P I U > l^^»4,il - |//„2,,| - |//„3,,|} I ^ as n ^ OO. 

^kn + l<j<Pn 

This indicates that with probability tending to 1, for all j = fc„ + 1, . . . the sign 
of is the same as vj, provided that |t;j| < A„, which further implies that 



lim P(/32„ = 0„J = 1. 



□ 
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Proof of Theorem^ Follow the argument in the proof of Theorem [3l Note that in 
the random design setting, under (Bl.a), 



max |//n2 7"l = 2 max 

kn + l<j<p„ kn + l<j<Pn 



<2||vi|| max J(X.,)'XiXiX., 
<2||vi|| max ||X.,||pi/,2,(XiX'i) 

< 2||vi||V^My^p„,ax(X;Xi) 



< 2MV^||vi|| [pm^4E[XiX[]) + ||An||] 



<2n||vi|U/7rfc„ +i?i/2||An|POp(l) 



= 2H|vi||W7r,„+Op(pi)^%^ 
V pi\/n 

<4n||vi||V^0p(l) 
for sufficiently large n. Similarly 

max |//„3j| < 4n||v2||^/w,„„Op(l). 

The rest of the argument is identical to that in the fixed design case and thus 
omitted here. □ 

Proof of Theorem\^ During the course of proving Lemma 1, we have under (AO)- 
(Al), |l3„ - /3|| = Op{\ny/kn/ pns + ^p„/(np„,i)). Under (A2), this implies that 

Il3i„-/3ill -op( min |/3^.|). 

Also from (A2), A„ — o(mini<j<fc^ Therefore, with probability tending to 1, 

all the [3j (1 < j < fc„) are bounded away from [— aAn,aA„] and so the partial 
derivatives exist. At the same time, /32„ — with probability tending to 1. 

Thus with probability tending to 1, the stationarity condition holds for the first fc„ 
components. That is, /3]^„ necessarily satisfies the equation 

n n n 

Y^iYi - X'a3i JX,i = 0, i.e. ^ e.X.i = ^ X.iX^^C^i^ - /3i). 

2—1 i=l i— 1 

So the random vector being considered 

Z„^V^E-i/2a„(3i„-/3i) 

n 

= ^n'/'A„ (X'lXi)"' Xa£^ 



164 



J. Huang and H. Xie 



where R^"'' = E„^^^A„(n ^X'^Xi) ^y^n- The equaUty holds with probabihty tend- 
ing to 1. niaxi<i<„ ||R|"^||/-yn ^ is impUed by (A4), as can be seen from 



(")i 



r'max 



Therefore for any ^ > 0, 
1 " 



e^llle.l > v^f/ max |1R, 

l<2<n 



(")i 



1 

2 = 1 
1 " 

i—l 

1 

= -^X^i (n-iX'iXi)"'A:,E-iA„ (n-^XiXi)"' X,io(l) 



= ^ tr <^ (n-iX'iXi) A:,E,-i A„ (ji-^X'iXi) 
i=i 

= tr{(n-ix;Xi)-'A:,E,-iA„}o(l) 

^«'A"(^EM'aj I 0(1) 



-1 XiiX^i 



o(l) 



o{l)d. 



So 



Z„ ^iV(Orf,/rf). 

foUows from the Lindeberg-Feller central limit theorem and Var(Z„) — Id- 
Proof of Theorem O The vector being considered 

1 " 
^I]-i/2A„£;-i/2[x,iX:,]^XaX:i(/3i„-/3i) 

1 " 

= —J=^n^^^-^nE ^/^[XjiX-^] £,jXii 

y/n ^ 



□ 
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with probabihty tending to 1. Let Z„i = -^^n A„i? ^/^[XiiX^j^JXiie^, i = 
1, . . . , 71. {Z„i, 71 = 1,2, ... ,i — 1, . . . , ?i} form a triangular array and within each 
row, they are i.i.d random vectors. First, 



Var (^E^^j = (l];:i/2A„i;-i/2[x,iX^JXne) = Id- 
Second, under (Bl.a), 

71 

J2 E [||Z„||2l{nz„,n>^j] = nE [||Z„if 

<nE''^[\\Z.^,t]P^'W\Z^,\\>0^o{\), 



Since 

E^'^[\\Zr.lt]=E^'\Z'^,Z.nlf] 



^1/2 



(x'ii£;-5[XnX'ii]A;,E-iA„i;-^[XnX'n]Xi 
< -ay'p,„ax(A:,E-iA„)p„,ax(£;"'[XiiX'ii])£;i/2 [(X'nXn 



< ^ay'p,„ax(S,T'A„A:jpi-i£;i/2 



(x(';)'] 



fc„ 



2^ 



n 



O 



npi 



and 



by the Lindeberg-Feller central limit theorem we have 

n-i/2s-i/2A„i;-i/2[X,iX',i]X;Xi(3i„ -/3i) ^ 7V(0rf,/d) 



□ 
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